NOTE: The exercise numbering here builds upon the previous activity.
Goals
Multivariate models are great! We’ve seen that adding more predictors to a model increases R-squared (the strength of the model in explaining the response variable) and allows us to control for important covariates. BUT. We must also consider more nuances and limitations to indiscriminately adding more predictors to our model.
Load the following data on penguins!
# Load data
library(ggplot2)
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(palmerpenguins)
library(magrittr)
data(penguins)
(Art by @allison_horst)
You can find a codebook for these data by typing ?penguins in your console (not Rmd). Our goal throughout will be to build a model of bill lengths (in mm):
NOTES:
What has been your favorite class at Mac?
flipper_length_mm variable currently measures flipper length in mm. Create a new variable named flipper_length_cm which measures flipper length in cm. Store this inside the penguins data. Hints:
penguins %<>% mutate(flipper_length_cm = flipper_length_mm / 10)
penguin_model_1 <- lm(bill_length_mm ~ flipper_length_mm, penguins)
penguin_model_2 <- lm(bill_length_mm ~ flipper_length_cm, penguins)
penguin_model_3 <- lm(bill_length_mm ~ flipper_length_mm + flipper_length_cm, penguins)
penguin_model_4 <- lm(bill_length_mm ~ body_mass_g, penguins)
penguin_model_5 <- lm(bill_length_mm ~ flipper_length_mm + body_mass_g, penguins)
What can a penguin’s flipper (arm) length tell us about their bill length? To answer this question, we’ll consider 3 of our models:
| model | predictors |
|---|---|
penguin_model_1 |
flipper_length_mm |
penguin_model_2 |
flipper_length_cm |
penguin_model_3 |
flipper_length_mm + flipper_length_cm |
bill_length_mm vs flipper_length_mm; and bill_length_mm vs flipper_length_cm.point_and_line <- function(x){ x +
geom_point() +
geom_smooth(method = "lm", se = FALSE)}
penguins %>%
ggplot(aes(x = bill_length_mm, y = flipper_length_mm)) %>%
point_and_line
## `geom_smooth()` using formula 'y ~ x'
## Warning: Removed 2 rows containing non-finite values (stat_smooth).
## Warning: Removed 2 rows containing missing values (geom_point).
summary(penguin_model_1)$coefficients
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -7.2648678 3.20015684 -2.27016 2.382330e-02
## flipper_length_mm 0.2547682 0.01588914 16.03410 1.743974e-43
penguins %>%
ggplot(aes(x = bill_length_mm, y = flipper_length_cm)) %>%
point_and_line
## `geom_smooth()` using formula 'y ~ x'
## Warning: Removed 2 rows containing non-finite values (stat_smooth).
## Warning: Removed 2 rows containing missing values (geom_point).
summary(penguin_model_2)$coefficients
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -7.264868 3.2001568 -2.27016 2.382330e-02
## flipper_length_cm 2.547682 0.1588914 16.03410 1.743974e-43
penguin_model_2 R-squared will be less than, equal to, or more than that of penguin_model_1?penguin_model_3 R-squared will compare to that of penguin_model_1?All three R-Squared are the same because they all reflect the same correlation. Constant multiples of the data doesn’t change the fit of the line.
Check your gut: Report the R-squared values for the three penguin models and summarize how these compare.
Explain why your observation in part c makes sense. Support your reasoning with a plot of the 2 predictors: flipper_length_mm vs flipper_length_cm.
penguins %>%
ggplot(aes(x = flipper_length_mm, y = flipper_length_cm)) %>%
point_and_line
## `geom_smooth()` using formula 'y ~ x'
## Warning: Removed 2 rows containing non-finite values (stat_smooth).
## Warning: Removed 2 rows containing missing values (geom_point).
summary(penguin_model_3), the flipper_length_cm coefficient is NA. Explain why this makes sense. HINT: Thinking about yesterday’s activity, why wouldn’t it make sense to interpret this coefficient? BONUS: For those of you that have taken MATH 236, this has to do with matrices that are not of full rank!Because the data in cm is a scaled duplicate of mm, that data contributes nothing to the model, so a coefficient would be not applicable in this context.
body_mass_gIn this exercise you’ll consider 3 models of bill_length_mm:
| model | predictors |
|---|---|
penguin_model_1 |
flipper_length_mm |
penguin_model_4 |
body_mass_g |
penguin_model_5 |
flipper_length_mm + body_mass_g |
bill_length_mm: flipper_length_mm or body_mass_g? Provide some numerical evidence.summary(penguin_model_4)$r.squared
## [1] 0.3541557
Flipper length seems to b ea better predictor of bill length than body mass because the r-squared of model 1 is higher than model 4.
penguin_model_5 incorporates both flipper_length_mm and body_mass_g as predictors. Before examining a model summary, ask your gut: Will the penguin_model_5 R-squared be close to 0.35, close to 0.43, or greater than 0.6?
Check your intuition. Report the penguin_model_5 R-squared and summarize how this compares to that of penguin_model_1 and penguin_model_4.
Explain why your observation in part c makes sense. Support your reasoning with a plot of the 2 predictors: flipper_length_mm vs body_mass_g.
The exercises above have illustrated special phenomena in multivariate modeling:
two predictors are redundant if they contain the same exact information
two predictors are multicollinear if they are strongly associated (they contain very similar information) but are not completely redundant.
Which penguin model had redundant predictors and which predictors were these?
Which penguin model had multicollinear predictors and which predictors were these?
In general, what happens to the R-squared value if we add a redundant predictor into a model: will it decrease, stay the same, increase by a small amount, or increase by a significant amount?
Similarly, what happens to the R-squared value if we add a multicollinear predictor into a model?
Not only does the strategy of adding more and more and more predictors complicate our models and have diminishing returns, it can give us silly results. Construct 2 models using a sample of 10 penguins (for illustration purposes only). NOTE: If you’re using an older version of RStudio, you might get a different sample from others but the ideas are the same!
# Take a sample of 10 penguins
# We'll discuss this code later in the course!
set.seed(155)
penguins_small <- sample_n(penguins, size = 10) %>%
mutate(flipper_length_mm = jitter(flipper_length_mm))
# 2 models
poly_mod_1 <- lm(bill_length_mm ~ flipper_length_mm, penguins_small)
poly_mod_2 <- lm(bill_length_mm ~ poly(flipper_length_mm, 2), penguins_small)
# 2 R-squared
summary(poly_mod_1)$r.squared
## [1] 0.7341412
summary(poly_mod_2)$r.squared
## [1] 0.7630516
# Plot the 2 models
ggplot(penguins_small, aes(y = bill_length_mm, x = flipper_length_mm)) +
geom_point() +
geom_smooth(method = "lm", se = FALSE)
## `geom_smooth()` using formula 'y ~ x'
ggplot(penguins_small, aes(y = bill_length_mm, x = flipper_length_mm)) +
geom_point() +
geom_smooth(method = "lm", formula = y ~ poly(x, 2), se = FALSE)
Adding a quadratic term to the model increased R-squared quite a bit. BUT! I! WANT! MORE! Adapt the code above to construct a model of bill_length_mm by poly(flipper_length_mm, 9) which has 9 polynomial predictors: flipper_length_mm, flipper_length_mm^2,…, flipper_length_mm^9.
Report the model’s R-squared and construct a visualization of this model (show the model trend on top of the scatterplot of raw data).
OK. We’ve learned that it’s always possible to get a perfect R-squared of 1. However, our example here demonstrates the drawbacks of overfitting a model to our sample data. Comment on the following:
bill_length_mm and flipper_length_mm?penguins_small sample?Zooming out, explain some limitations of relying on R-squared to measure the strength / usefulness of a model.
Check out the image from the front page of the manual. Which plot pokes fun at overfitting?
There are so many models we could build! For each possible research goal, indicate which predictors you’d include in your model.
i: We want to understand the relationship between bill length and depth when controlling for penguin species. ii: We want to be able to predict a penguin’s bill length from its flipper length alone (because maybe penguins let us get closer to their arms than their nose?).
Aren’t you so curious about penguins? Identify one new question of interest, and explore this question using the data. It can be a simple question and answered in 1 line / 1 set of lines / 1 plot of R code, so long as it’s not explored elsewhere in this activity.
We’ve seen that, unless a predictor is redundant with another, R-squared will increase. Even if that predictor is strongly multicollinear with another. Even if that predictor isn’t a good predictor! Thus if we only look at R-squared we might get overly greedy. We can check our greedy impulses a few ways. We take a more in depth approach in STAT 253, but one quick alternative is reported right in our model summary() tables. Adjusted R-squared includes a penalty for incorporating more and more predictors. Mathematically:
\[\text{Adj R^2} = R^2 - (1-R^2)\frac{\text{number of non-intercept coefficients}}{\text{sample size}} = R^2 - \text{ penalty}\]
Thus unlike R-squared, Adjusted R-squared can decrease when the information that a predictor contributes to a model isn’t enough to offset the complexity it adds to that model. Consider two models:
example_1 <- lm(bill_length_mm ~ species, penguins)
example_2 <- lm(bill_length_mm ~ species + island, penguins)
Check out the summaries for the 2 example models. In general, how does a model’s Adjusted R-squared compare to the R-squared? Is it greater, less than, or equal to the R-squared?
How did the R-squared change from example model 1 to model 2? How did the Adjusted R-squared change?
Explain what it is about island that resulted in a decreased Adjusted R-squared. Note: it’s not necessarily the case that island is a bad predictor on its own!
Practice making some plots of the above models. And maybe even practice interpreting coefficients!